AI has the potential to touch and transform all aspects of our lives. Innovations are being made today using AI across industries such as:
All these industries use AI to improve productivity, consumer decision-making, and enhancing the education experience. Running the complex AI workloads that deliver these results requires significant compute and data center power.
Today’s data centers already consume lots of power and that will only continue to grow as AI deployments broaden and the underlying foundation models get bigger in size. Arm is addressing this challenge by enabling additional AI capacity without adding to the energy problem. As Generative AI and foundation models have gained popularity, deployments have been made difficult by availability of specialized compute hardware and their associated high costs. Larger models are also resource-intensive which exacerbates the original problem. The rise of smaller language models and techniques such as quantization is encouraging developers to consider CPUs for machine learning. Smaller models offer efficiency and can be tailored to more narrow and specific applications, making them practical to deploy and more cost-effective.
Arm’s latest Neoverse based CPU platforms offer the most-high performance, power-efficient processors for cloud data centers. Arm Neoverse offers Cloud providers with the flexibility to customize their silicon and optimize their software and systems for the most demanding workloads, all while delivering leading performance and power efficiency. This is why all major cloud providers have all adopted Arm Neoverse technology to design their compute platforms to address the needs of developers for a wide range of cloud workloads including AI and ML workloads.
Popular open-source models in Hugging Face run on-CPU with efficiency and performance. Deploying models can be a time-consuming and challenging task oftentimes requiring deep expertise in ML and underlying model code. Hugging Face pipelines abstract away complexity from the code and enable developers to use any model from the Hub for inference. Developers building AI applications and projects can benefit from the ease of cloud infrastructure resources, power efficiency, and cost savings associated with Arm-powered cloud instances.
CPUs have long benefited from using a single instruction to process multiple data points simultaneously resulting in data level parallelism and performance gains, in a technique known as SIMD. Arm Neoverse CPUs support advanced SIMD technology such as NEON and SVE which can accelerate common algorithms used in HPC and ML.
GEMM (General Matrix Multiplications) is an essential algorithm in machine learning that performs a complex multiplication of two input matrices together to get one output. Arm v8.6-A architecture adds SMMLA and FMMLA instructions that can perform these multiplications on a 2 or 4-wide array at a time, reducing the fetch cycles by 2x to 4x and the compute cycles by 4x to 16x. These instructions are part of several Arm-based server processors including AWS Graviton3 and Graviton4, NVIDIA Grace, Google Axion and Microsoft Cobalt.
These key features benefit machine learning cross many use cases including:
With these ML inferencing capabilities with Arm Neoverse based AWS Graviton3 processors, we have achieved up to 3x better performance boost compared to previous generation AWS Graviton2 processors. Let’s dive into a sentiment analysis use case:
Sentiment analysis is a vital AI technique that figures out emotions and opinions from written text. Businesses use it to grasp what customers think, evaluate how people perceive their brand, and shape marketing decisions. But running sentiment analysis models efficiently can be demanding on computational resources. This blog post dives into how Arm Neoverse CPUs can speed up sentiment analysis, resulting in quicker and more impactful AI-driven insights.
Specifically, we are going to focus on speeding up the NLP PyTorch models (BERT, DistilBERT, and RoBERTa) on Arm Neoverse CPUs with default PyTorch package available in pytorch.org. We will use the Hugging Face Transformer Sentiment Analysis pipeline to run these models.
Hugging Face Transformers simplify the use of their pre-trained models with a powerful tool called pipelines. These pipelines handle complexities behind the scenes, allowing you to focus on solving the actual problem.
For instance, if you want to analyze the sentiment of a piece of text, just input it into the pipeline. It will provide sentiment classification (positive or negative) without worrying about model loading, tokenization, or other technical details.
This bit of code uses pipeline class to check how people feel about input text. It uses a ready-made model from Hugging Face model hub behind the scenes.
Code:
from transformers import pipeline pipe = pipeline("sentiment-analysis") data = ["I like the product a lot", "I wish I had not bought this"] pipe(data)
Output:
[{'label': 'POSITIVE', 'score': 0.9997499585151672}, {'label': 'NEGATIVE', 'score': 0.9996662139892578}]
You can also specify a model of your choice using the model parameter.
pipe = pipeline("sentiment-analysis", model=”distilbert-base-uncased”)
When adding sentiment analysis to your existing application, it's important to consider latency. For real-time use cases, a response time of less than 100 milliseconds is typically perceived as instantaneous. However, higher latency may be acceptable for your specific needs.
We took two reviews, a short review (32 tokens when tokenized with BertTokenizer) and a long review (128 tokens when tokenized with BertTokenizer) and benchmarked on AWS Graviton 2 (c6g) and AWS Graviton 3 (c7g).
Both AWS Graviton2 (c6g) and AWS Graviton3 (c7g) met the ideal real-time latency target of 100ms for a short review sentiment analysis with just 4-vCPUs, as it can be seen in the below graph.
AWS Graviton3 (c7g) with BF16 enabled can meet ideal real-time latency target for a longer review sentiment analysis with 4-vCPUs. Arm Neoverse V1 based c7g instances provide up to a 3x boost in performance compared to previous generation c6g instances (with Arm Neoverse N1 CPUs).
We conducted the benchmark tests on following AWS EC2 instances:
c6g.xlarge
c7g.xlarge
The instances have 4 vCPUs. We set the instances up with following software.
We followed the following setup steps.
sudo apt-get update
For further details on the installation process, refer to https://learn.arm.com/install-guides/pytorch/
Arm PyTorch Installation Guide (https://learn.arm.com/install-guides/pytorch/) and PyTorch Inference Tuning on AWS Graviton (https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.html) provide a few tuning parameters for Arm.
For the benchmarking, we enabled bfloat16 fast math kernels on all platforms as shown below. On AWS Graviton3, this enables GEMM kernels that use bfloat16 MMLA instructions available in the hardware.
export DNNL_DEFAULT_FPMATH_MODE=BF16
We used two reviews – a short and long review.
short_review: "I'm extremely satisfied with my new Ikea Kallax; It's an excellent storage solution for our kids. A definite must have."long_review: "We were in search of a storage solution for our kids, and their desire to personalize their storage units led us to explore various options. After careful consideration, we decided on the Ikea Kallax system. It has proven to be an ideal choice for our needs. The flexibility of the Kallax design allows for extensive customization. Whether it’s choosing vibrant colors, adding inserts for specific items, or selecting different finishes, the possibilities are endless. We appreciate that it caters to our kids’ preferences and encourages their creativity. Overall, the boys are thrilled with the outcome. A great value for money."
We evaluated three NLP models (distilbert-base-uncased, bert-base-uncased, and roberta-base) using the sentiment analysis pipeline.
distilbert-base-uncased, bert-base-uncased
roberta-base
For each model, we measure the execution time for both short and long sentences. In the benchmark function, we perform a warm-up phase (running the pipeline 100 times) to ensure consistent results. Then, we measure the execution time for each run and calculate the mean and 99th percentile values.
With AWS Graviton3, you can add sentiment analysis to your existing application that meets stringent real-time latency requirements with just 4 vCPUs.
AWS Graviton3, equipped with an Arm Neoverse V1 CPU featuring ML-specific features like bfloat16 MMLA extension, delivers outstanding inference performance for Hugging Face Sentiment Analysis PyTorch models.
Feel free to try it with your models. Depending on your model, you might need to fine-tune performance. For this purpose, the following resources will be useful:
PyTorch Install Guide on learn.arm.com (https://learn.arm.com/install-guides/pytorch/).
PyTorch Inference Performance Tuning on AWS Graviton Processors (https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.html).